Search CORE

8 research outputs found

MLS: A Large-Scale Multilingual Dataset for Speech Research

Author: Collobert Ronan
Pratap Vineel
Sriram Anuroop
Synnaeve Gabriel
Xu Qiantong
Publication venue: 'International Speech Communication Association'
Publication date: 19/12/2020
Field of study

This paper introduces Multilingual LibriSpeech (MLS) dataset, a large multilingual corpus suitable for speech research. The dataset is derived from read audiobooks from LibriVox and consists of 8 languages, including about 44.5K hours of English and a total of about 6K hours for other languages. Additionally, we provide Language Models (LM) and baseline Automatic Speech Recognition (ASR) models and for all the languages in our dataset. We believe such a large transcribed dataset will open new avenues in ASR and Text-To-Speech (TTS) research. The dataset will be made freely available for anyone at http://www.openslr.org

arXiv.org e-Print Archive

Crossref

wav2letter++: The Fastest Open-source Speech Recognition System

Author: Cai Jeff
Collobert Ronan
Hannun Awni
Kahn Jacob
Liptchinsky Vitaliy
Pratap Vineel
Synnaeve Gabriel
Xu Qiantong
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 18/12/2018
Field of study

This paper introduces wav2letter++, the fastest open-source deep learning speech recognition framework. wav2letter++ is written entirely in C++, and uses the ArrayFire tensor library for maximum efficiency. Here we explain the architecture and design of the wav2letter++ system and compare it to other major open-source speech recognition systems. In some cases wav2letter++ is more than 2x faster than other optimized frameworks for training end-to-end neural networks for speech recognition. We also show that wav2letter++'s training times scale linearly to 64 GPUs, the highest we tested, for models with 100 million parameters. High-performance frameworks enable fast iteration, which is often a crucial factor in successful research and model tuning on new datasets and tasks

arXiv.org e-Print Archive

Crossref

Scaling Speech Technology to 1,000+ Languages

Author: Adi Yossi
Auli Michael
Babu Arun
Baevski Alexei
Conneau Alexis
Elkahky Ali
Fazel-Zarandi Maryam
Hsu Wei-Ning
Kundu Sayani
Ni Zhaoheng
Pratap Vineel
Shi Bowen
Tjandra Andros
Tomasello Paden
Vyas Apoorv
Zhang Xiaohui
Publication venue
Publication date: 22/05/2023
Field of study

Expanding the language coverage of speech technology has the potential to improve access to information for many more people. However, current speech technology is restricted to about one hundred languages which is a small fraction of the over 7,000 languages spoken around the world. The Massively Multilingual Speech (MMS) project increases the number of supported languages by 10-40x, depending on the task. The main ingredients are a new dataset based on readings of publicly available religious texts and effectively leveraging self-supervised learning. We built pre-trained wav2vec 2.0 models covering 1,406 languages, a single multilingual automatic speech recognition model for 1,107 languages, speech synthesis models for the same number of languages, as well as a language identification model for 4,017 languages. Experiments show that our multilingual speech recognition model more than halves the word error rate of Whisper on 54 languages of the FLEURS benchmark while being trained on a small fraction of the labeled data

arXiv.org e-Print Archive

TorchAudio 2.1: Advancing speech recognition, self-supervised learning, and audio processing components for PyTorch

Author: Chen Caroline
Cornell Samuele
Hira Moto
Huang Ruizhe
Hwang Jeff
Kahn Jacob
Kim Sean
Kumar Anurag
Liu Chunxi
Ma Pingchuan
Ni Zhaoheng
Petridis Stavros
Pratap Vineel
Ravanelli Mirco
Scheibler Robin
Shi Yangyang
Sun Guangzhi
Sun Peng
Tao Yumeng
Watanabe Shinji
Yu Chin-Yun
Zhang Xiaohui
Zhang Yuekai
Zhu Chuang
Publication venue
Publication date: 26/10/2023
Field of study

TorchAudio is an open-source audio and speech processing library built for PyTorch. It aims to accelerate the research and development of audio and speech technologies by providing well-designed, easy-to-use, and performant PyTorch components. Its contributors routinely engage with users to understand their needs and fulfill them by developing impactful features. Here, we survey TorchAudio's development principles and contents and highlight key features we include in its latest version (2.1): self-supervised learning pre-trained pipelines and training recipes, high-performance CTC decoders, speech recognition models and training recipes, advanced media I/O capabilities, and tools for performing forced alignment, multi-channel speech enhancement, and reference-less speech assessment. For a selection of these features, through empirical studies, we demonstrate their efficacy and show that they achieve competitive or state-of-the-art performance

arXiv.org e-Print Archive

Performance and Efficiency Evaluation of ASR Inference on the Edge

Author: Santosh Gondi
Vineel Pratap
Publication venue: MDPI AG
Publication date: 01/11/2021
Field of study

Automatic speech recognition, a process of converting speech signals to text, has improved a great deal in the past decade thanks to the deep learning based systems. With the latest transformer based models, the recognition accuracy measured as word-error-rate (WER), is even below the human annotator error (4%). However, most of these advanced models run on big servers with large amounts of memory, CPU/GPU resources and have huge carbon footprint. This server based architecture of ASR is not viable in the long run given the inherent lack of privacy for user data, reliability and latency issues of the network connection. On the other hand, on-device ASR (meaning, speech to text conversion on the edge device itself) solutions will fix deep-rooted privacy issues while at same time being more reliable and performant by avoiding network connectivity to the back-end server. On-device ASR can also lead to a more sustainable solution by considering the energy vs. accuracy trade-off and choosing right model for specific use cases/applications of the product. Hence, in this paper we evaluate energy-accuracy trade-off of ASR with a typical transformer based speech recognition model on an edge device. We have run evaluations on Raspberry Pi with an off-the-shelf USB meter for measuring energy consumption. We conclude that, in the case of CPU based ASR inference, the energy consumption grows exponentially as the word error rate improves linearly. Additionally, based on our experiment we deduce that, with PyTorch mobile optimization and quantization, the typical transformer based ASR on edge performs reasonably well in terms of accuracy and latency and comes close to the accuracy of server based inference

Directory of Open Access Journals

Performance and Efficiency Evaluation of ASR Inference on the Edge

Author: Santosh Gondi
Vineel Pratap
Publication venue: 'MDPI AG'
Publication date: 10/11/2021
Field of study

Multidisciplinary Digital Publishing Institute